Globy webcrawlers
Scrape the world with Globy!
Usage
$ globy-scraper.py -h
usage: globy-scraper.py [-h] -f URLS_FILE [-l LOGLEVEL] [-o OUTPUT_FILE] [-d] [-s] [-b BACKEND]
options:
-h, --help show this help message and exit
-f URLS_FILE, --urls_file URLS_FILE
Newline separated file of URLs to scrape
-l LOGLEVEL, --loglevel LOGLEVEL
Set log level (1-3)
-o OUTPUT_FILE, --output_file OUTPUT_FILE
Output CSV file path
-d, --debug Debug/inspect responses (Will set PYTHONINSPECT to True)
-s, --store_html_to_file
Store HTML all content to file per domain in the "html_output" folder
-b BACKEND, --backend BACKEND
Select backend to use for scraping: "globy" (default) or "scrapy".
(!) The scrapy backend will only dump data to the "html_output" folder for now. No website analysis or other functionallity is supported yet.
Quickstart
You can just run globy-scraper as a script (no installation needed):
pip3 install -r requirements.txt # Install dependencies if not already installed
./globy-scraper.py -f wordpress-top50.txt
Install the globy-webcrawlers package
You can install the package for development and use with other Globy projects. It's recommended to set up a python virtual environment before installing the package.
pip3 install -e .
Now you can use the package in python:
from globy_webcrawlers.crawler import WebSiteDataCrawler
>>> c = WebSiteDataCrawler()
>>> c.load_urls_from_file("urls.txt")
>>> c.run()
Also, after installing globy-scraper, you can just run it: globy-scraper.py -h
`
Debugging/inspecting website content
Since the crawler is asynchronous, it can be a bit tricky to debug the responses. To make it easier, you can use the -d flag. This will allow you to inspect the HTML content and any internal objects from the most recent website. Here's an example:
$ ./globy-scraper.py -f wordpress-top50.txt -d
>>> w = crawler.debug_latest_response()
>>> w.url
'https://www.thyroid.org/'
>>> len(w.html_content)
249167
>>> w.get_website_info()
Change the crawler.urls to get the response you want to debug, or the input url list.
Development only
Build and & create python package:
python3 -m build